AITopics | language code

Collaborating Authors

language code

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Appendix of Modeling

Neural Information Processing SystemsApr-25-2026, 13:47:11 GMT

To create a passage representation, the passage title and text are concatenated ([CLS]title [SEP]passage [SEP]), following common practice (Karpukhin et al., 2020). We retrieve top 10 passages and use them as input to mGEN. We differentiate those paragraphs from the question using special tokens (

vs. He graduated with a B.S. degree in Biology in 1957. As in the case of machine translation, we found that the language code does not need to be specified during inference as our model learns the question language automatically. Yet, we found that training with language codes is particularly useful to augment training data for Ltarget without any question data in Ltarget.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.14)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.34)

Add feedback

Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks

Kilictas, Bugra, Alpay, Faruk

arXiv.org Artificial IntelligenceJun-10-2025

ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity $χ$ a family of fixed-point operators $ϕ_{n,m}$ that model bounded semantic drift via the relation $ϕ_{n,m}(χ) = χ\oplus Δ(χ)$, where $Δ(χ)$ is a drift vector in a latent semantic manifold. The base anchor $ϕ_{0,0}$ recovers the canonical ISO 639:2023 identity, whereas $ϕ_{99,9}$ marks the maximal drift state that triggers a deterministic fallback. Using category theory, we treat the operators $ϕ_{n,m}$ as morphisms and drift vectors as arrows in a category $\mathrm{DriftLang}$. A functor $Φ: \mathrm{DriftLang} \to \mathrm{AnchorLang}$ maps every drifted object to its unique anchor and proves convergence. We provide an RDF/Turtle schema (\texttt{BaseLanguage}, \texttt{DriftedLanguage}, \texttt{ResolvedAnchor}) and worked examples -- e.g., $ϕ_{8,4}$ (Standard Mandarin) versus $ϕ_{8,7}$ (a colloquial variant), and $ϕ_{1,7}$ for Nigerian Pidgin anchored to English. Experiments with transformer models show higher accuracy in language identification and translation on noisy or code-switched input when the $ϕ$-indices are used to guide fallback routing. The framework is compatible with ISO/TC 37 and provides an AI-tractable, drift-aware semantic layer for future standards.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.0687

Genre: Research Report (0.50)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.89)

Add feedback

Learn and Don't Forget: Adding a New Language to ASR Foundation Models

Qian, Mengjie, Tang, Siyuan, Ma, Rao, Knill, Kate M., Gales, Mark J. F.

arXiv.org Artificial IntelligenceJul-9-2024

Foundation ASR models often support many languages, e.g. 100 languages in Whisper. However, there has been limited work on integrating an additional, typically low-resource, language, while maintaining performance on the original language set. Fine-tuning, while simple, may degrade the accuracy of the original set. We compare three approaches that exploit adaptation parameters: soft language code tuning, train only the language code; soft prompt tuning, train prepended tokens; and LoRA where a small set of additional parameters are optimised. Elastic Weight Consolidation (EWC) offers an alternative compromise with the potential to maintain performance in specific target languages. Results show that direct fine-tuning yields the best performance for the new language but degrades existing language capabilities. EWC can address this issue for specific languages. If only adaptation parameters are used, the language capabilities are maintained but at the cost of performance in the new language.

fine-tuning, language code, new language, (15 more...)

arXiv.org Artificial Intelligence

2407.068

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Frohmann, Markus, Sterner, Igor, Vulić, Ivan, Minixhofer, Benjamin, Schedl, Markus

arXiv.org Artificial IntelligenceJun-24-2024

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.

computational linguistic, proceedings, segmentation, (14 more...)

arXiv.org Artificial Intelligence

2406.16678

Country:

North America > Dominican Republic (0.04)
Asia > Singapore (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(26 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Leisure & Entertainment (1.00)
Information Technology (0.67)
Media > Music (0.46)
Health & Medicine > Pharmaceuticals & Biotechnology (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceMay-15-2024

We introduce ParaNames, a massively multilingual parallel name resource consisting of 140 million names spanning over 400 languages. Names are provided for 16.8 million entities, and each entity is mapped from a complex type hierarchy to a standard type (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate the usefulness of ParaNames on two tasks. First, we perform canonical name translation between English and 17 other languages. Second, we use it as a gazetteer for multilingual named entity recognition, obtaining performance improvements on all 10 languages evaluated.

experiment, paraname, wikidata, (15 more...)

arXiv.org Artificial Intelligence

2405.09496

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
North America > United States > New York (0.04)
(17 more...)

Genre:

Research Report > New Finding (0.47)
Research Report > Experimental Study (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

Chen, Wei-Rui, Adebara, Ife, Doan, Khai Duy, Liao, Qisheng, Abdul-Mageed, Muhammad

arXiv.org Artificial IntelligenceNov-16-2023

Recently, ChatGPT has emerged as a powerful NLP tool that can carry out several tasks. However, the range of languages ChatGPT can handle remains largely a mystery. In this work, we investigate ChatGPT's language identification abilities. For this purpose, we compile Babel-670, a benchmark comprising $670$ languages representing $23$ language families. Languages in Babel-670 run the gamut between the very high-resource to the very low-resource and are spoken in five continents. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify both language names and language codes (ii) under both zero- and few-shot conditions (iii) with and without provision of label set. When compared to smaller finetuned language identification tools, we find that ChatGPT lags behind. Our empirical analysis shows the reality that ChatGPT still resides in a state of potential enhancement before it can sufficiently serve diverse communities.

chatgpt, language code, language name, (17 more...)

arXiv.org Artificial Intelligence

2311.09696

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)
North America > Canada > British Columbia (0.04)
(7 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation

Yuan, Fei, Lu, Yinquan, Zhu, WenHao, Kong, Lingpeng, Li, Lei, Qiao, Yu, Xu, Jingjing

arXiv.org Artificial IntelligenceJul-19-2023

Multilingual neural machine translation (MNMT) aims to build a unified model for many language directions. Existing monolithic models for MNMT encounter two challenges: parameter interference among languages and inefficient inference for large models. In this paper, we revisit the classic multi-way structures and develop a detachable model by assigning each language (or group of languages) to an individual branch that supports plug-and-play training and inference. To address the needs of learning representations for all languages in a unified space, we propose a novel efficient training recipe, upon which we build an effective detachable model, Lego-MT. For a fair comparison, we collect data from OPUS and build a translation benchmark covering 433 languages and 1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings an average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters. The proposed training recipe brings a 28.2$\times$ speedup over the conventional multi-way training method.\footnote{ \url{https://github.com/CONE-MT/Lego-MT}.}

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2212.10551

Country:

Oceania > Tonga (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > District of Columbia > Washington (0.04)
(20 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

ParaNames: A Massively Multilingual Entity Name Corpus

Sälevä, Jonne, Lignos, Constantine

arXiv.org Artificial IntelligenceJul-12-2022

We introduce ParaNames, a multilingual parallel name resource consisting of 118 million names spanning across 400 languages. Names are provided for 13.6 million entities which are mapped to standardized entity types (PER/LOC/ORG). Using Wikidata as a source, we create the largest resource of this type to-date. We describe our approach to filtering and standardizing the data to provide the best quality possible. ParaNames is useful for multilingual language processing, both in defining tasks for name translation/transliteration and as supplementary data for tasks such as named entity recognition and linking. We demonstrate an application of ParaNames by training a multilingual model for canonical name translation to and from English. Our resource is released under a Creative Commons license (CC BY 4.0) at https://github.com/bltlab/paranames.

experiment, information, transliteration, (14 more...)

arXiv.org Artificial Intelligence

2202.14035

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York (0.04)
Asia > China (0.04)
(14 more...)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.55)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.35)

Add feedback

Translating with Google Sheets

#artificialintelligenceNov-8-2021, 19:05:50 GMT

Google Translate is an amazing feat of engineering, which uses artificial intelligence to translate speech and text from a chosen language into another. In most cases, Google Translate's own interface embedded in Google Search or on translate.google.com Again, as in other Case Studies presented here, Google Sheets comes to the rescue! Other than formatting the file to your liking, you can create some drop-down lists for the Source and Target languages. This will help you being more productive as you do not need to search for the language codes every time you want to change them. In my case, I used the Data Validation feature using as a criterion a List from a Range.

formula, google sheet, language code, (10 more...)

#artificialintelligence

Industry: Information Technology > Services (0.37)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Build a Translation Application with AWS

#artificialintelligenceAug-14-2021, 02:55:15 GMT

Amazon's suite of ML services is constantly expanding. From having capabilities of building custom ML pipelines in SageMaker to a versatile set of AutoML services, options to deploy and tackle ML problems are limitless. Neural Machine Translation is a theoretically intense field and requires deep knowledge of LSTMs and Deep Learning frameworks such as TensorFlow and PyTorch. For this article we will explore AWS Translate, a Neural Machine Translation tool that supports 71 languages and lets you build applications with a simple API call. This article is a continuation of the Auto-ML on AWS series, check out the Rekognition and Comprehend articles for the first two parts.

api call, aw translate, translation application, (7 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback